Distinct Elements is Hard Even for Random Data Streams

نویسنده

  • David Woodruff
چکیده

We continue the study of approximating the number of distinct elements in the data stream model to within a (1 ± ) factor with constant probability. It was shown by Indyk and Woodruff (FOCS, 2003) that if the stream may consist of arbitrary data arriving to the streaming algorithm in an arbitrary order, then any 1-pass algorithm requires Ω(1/ ) bits of space to perform this task. In an attempt to bypass this lower bound, Chakrabarti et al (STOC, 2008) define a robust streaming model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model the authors were able to show an Ω(1/ ) lower bound. This is because the adversary can still choose the data arbitrarily. We take this a step further and show that even with random data and a random ordering, there is an Ω(1/ ) lower bound. This holds even if each successive stream item is drawn independently and uniformly at random from a subset of items. Our result subsumes all previous ones and shows that this potentially practical assumption of random data does not help. Another relaxation of the problem is to allow multiple passes over the data stream. We also show that even with random data, if in each pass a linear sketch over GF (2) of the data is computed (possibly adaptively), then there is an Ω(1/ ) lower bound. Known approximation algorithms compute such sketches. Previously, nothing better than an Ω(1/ ) bound was known for any class of multi-pass algorithms in any of the above data stream models. Our bound applies to all such models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Streams as Random Permutations: the Distinct Element Problem

We illustrate this by introducing RECORDINALITY, an algorithm which estimates the number of distinct elements in a stream by counting the number of k-records occurring in it. The algorithm has a score of interesting properties, such as providing a random sample of the set underlying the stream. To the best of our knowledge, a modified version of RECORDINALITY is the first cardinality estimation...

متن کامل

ارائه روشی پویا جهت پاسخ به پرس‌وجوهای پیوسته تجمّعی اقتضایی

Data Streams are infinite, fast, time-stamp data elements which are received explosively. Generally, these elements need to be processed in an online, real-time way. So, algorithms to process data streams and answer queries on these streams are mostly one-pass. The execution of such algorithms has some challenges such as memory limitation, scheduling, and accuracy of answers. They will be more ...

متن کامل

Elements of metacommunity structure in Amazonian Zygoptera among streams under different spatial scales and environmental conditions

An important aspect of conservation is to understand the founding elements and characteristics of metacommunities in natural environments, and the consequences of anthropogenic disturbance on these patterns. In natural Amazonian environments, the interfluves of the major rivers play an important role in the formation of areas of endemism through the historical isolation of species and the speci...

متن کامل

An Analysis of Dialogism in Mikhail Bakhtin’s Thought: Convergence of Philosophy and Methodology

 Undoubtedly, the twentieth century can be regarded as one of the richest periods of the history of philosophy and thought which globalized this tradition, generally, because of the spread of mass media and even the published books, and joined all vast and narrow streams, here and there, together and at their Juncture a big sea is formed which is the most important gain of the century. One of t...

متن کامل

Guest Editor Introduction: Special Section on Online Analysis and Querying of Continuous Data Streams

IN a number of application domains, data arrives continuously in the form of a stream and needs to be processed in an online fashion. For example, in the network installations of large Telecom and Internet service providers, detailed usage information (e.g., Call Detail Records or CDRs, IP traffic statistics due to SNMP/RMON polling, etc.) from different parts of the network needs to be continu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008